home *** CD-ROM | disk | FTP | other *** search
- From caasi@ucselx.sdsu.edu Wed Oct 3 20:12:20 1990
- Return-Path: <caasi@ucselx.sdsu.edu>
- From: caasi@ucselx.sdsu.edu (richard)
- Subject: Basics of the TEI, part 1: design goals
- To: bzs@world.std.com
- Date: Fri, 31 Aug 90 8:04:54 PDT
- X-Mailer: ELM [version 2.2 PL0]
-
- Date: Fri, 17 Aug 90 10:46:03 CDT
- Comments: "ACH / ACL / ALLC Text Encoding Initiative"
- From: Michael Sperberg-McQueen 312 996-2477 -2981 <U35395@UICVM.uic.edu>
- Subject: Basics of the TEI, part 1: design goals
-
- This list has had a lot of recent subscriptions in response to the
- announcement that the TEI Guidelines are now available in draft form;
- TEI-L now goes to over 275 addresses. The 600 pre-printed copies of the
- draft, which we originally thought might be a bit too many to get rid of
- in the year before version 2 is ready, may at this rate all be spoken
- for before the month of August is out.
-
- We're happy about all the interest, because it suggests that many
- others agree with the organizers of the TEI that we need methods for
- text encoding suitable for multiple uses of the same texts, for exchange
- of texts among researchers and others interested, for languages other
- than English and scripts other than Latin, and which will work with all
- kinds of text, not only the most common.
-
- This list should play a big role in the revision of the Guidelines,
- and to help get the relevant discussion started, it might be a good idea
- for the editors to discuss from time to time some of the background to
- the current draft -- a sort of TEI tutorial over the net. This will, we
- hope, provoke some questions from participants in the list, and will
- lead over time to discussions of the many thorny technical and other
- issues involved with a project like this. Much of what we say at the
- beginning may seem (or be) basic and uncontroversial, and those who like
- fireworks may wish we would jump right to the burning questions and get
- some arguments going. It appears though that some of the noncontrover-
- sial basics are essential to even understanding some of the trickier
- burning questions, so we are going to go slow at first. Anyone who
- wants to start a second thread on any burning issue of their choice may
- do so.
-
- We count on the many participants in this list who are serving on the
- TEI working committees to jump in and amplify or supplement our account
- wherever you see fit.
-
-
- WHO IS THE TEI FOR?
-
- Let's start with something fairly simple: who is the TEI for and
- what are the basic goals?
-
- The goals of the TEI are to define a format for encoding texts in a
- linear data stream which is suitable for the interchange of textual
- material between researchers, and to provide concrete recommendations,
- for those who can use them, as to what features of texts should usually
- be recorded. As the letterhead puts it, the TEI is an "Initiative for
- Text Encoding Guidelines and a Common Interchange Format for Literary
- and Linguistic Data". Note some non-obvious points:
-
- 1. The TEI came out of the community of those using computers to do
- research on or with texts, and they are our primary constituency.
- That is: literary scholars, linguists, computational linguists,
- historians, philosophers, theologians, philologists, people work-
- ing on machine translation, ... you name it. The publishing
- industry, database vendors, software developers, and others with
- commercial interests in electronic text are interested in the TEI,
- and many are sharing their expertise with us, but they are not the
- *primary* constituency. If research and publishing were to turn
- out to require different things, the TEI would go with the needs
- of researchers.
-
- It's important to note that this is mostly an imaginary issue:
- so far the requirements of all these groups seem astonishingly
- close to identical. Very concretely: I have not encountered a
- single problem faced by humanists which does not have an analogue
- in a problem faced by linguists, and one in a problem faced by
- publishers or commercial database vendors. And vice versa. Some-
- times the problems look different, but so far most differences
- have proven superficial. We believe that what will work for
- researchers must work for other applications as well. So in a
- real sense, though researchers are the primary constituency, the
- real intended constituency is everyone who works with electronic
- text in *any* way, and wants to be able (a) to move the text from
- system to system without information loss, or (b) to use the text
- for more than one thing.
-
- 2. One major intended use for the Guidelines is as a specification
- for an interchange format. Transfer between researchers,
- machines, programs, networks would use such a format very simply:
- as a description of what my text will look like when it passes
- from my hands to yours, or what I would like yours to look like
- when yours reaches me. An interchange format does not tell anyone
- what to encode, any more than the ASCII code tells us how to write
- novels or manuals. What is encoded is the intellectual responsi-
- bility of the researcher; no one can take that responsibility
- away.
-
- 3. The other major intended use is as a guide for those encoding
- texts for general use (and one hopes that that includes most of
- those encoding texts). The Guidelines should provide a sample set
- of textual features that many people have found useful in textual
- work, together with ways of encoding those features. No one is
- required to encode all those textual features, but the list should
- (if we do our work right) be taken seriously as a checklist of
- what the community as a whole tends to find useful.
-
- Software developers should also benefit from the guidelines in both
- these ways: as a definition of an export-import format (or as an inter-
- nal file format, if you wish!) *and* as a checklist of textual features
- commonly thought important. I suppose many of us have seen software
- which suffered from its makers' sometimes unconsciously narrow concep-
- tion of the kinds of texts it would be used for -- the Guidelines should
- be useful as a sort of brain-storming, concept-broadening tool for
- developers.
-
-
- 1.1 Basic requirements
-
- The basic requirements for a text encoding scheme have been stated in
- the NEH proposals for TEI funding. (Quick tip of the hat to the NEH,
- the EEC, and the Mellon Foundation for their funding. Without them, it
- wouldn't be happening nearly as fast.)
-
- An encoding scheme is any (systematic?) method of representing or
- encoding textual data in machine-readable form. Typically, an encoding
- scheme must include:
-
- 1. methods for recording the characters in the text (including dia-
- critics, special symbols, non-Roman alphabets, etc.)
- 2. conventions for rendering a text in a single linear sequence
- (specifying how footnotes, end-notes, critical apparatus, parallel
- texts, and other non-linear complications are handled)
- 3. methods for recording logical divisions of texts (e.g. book, chap-
- ter, paragraph; act, scene, speech, line; ...)
- 4. methods for recording analytic information like literary or lin-
- guistic analysis
- 5. conventions for delimiting in-line comments and other ancillary
- material
- 6. conventions for identifying the text being encoded and those
- responsible for encoding it
-
- To create a single encoding scheme suitable for common use, the TEI
- first formulated (in the original planning conference in 1987 and in
- working papers since) the following requirements for the scheme to be
- developed:
-
- 1. It should specify a common interchange format.
- 2. It should provide a set of recommendations for encoding new textu-
- al materials.
- 3. It should document the existing major schemes and investigate the
- feasibility of developing a metalanguage in which to describe
- them.
- 4. It must be a set of guidelines, not a set of rigid requirements.
- 5. It must be extensible.
- 6. It should be device- and software-independent.
- 7. It should be language-independent.
- 8. It should be application-independent.
-
- As design goals, it was specified that the guidelines should:
-
- 1. suffice to represent the textual features needed for research
- 2. be simple, clear, and concrete
- 3. be easy for researchers to use without special-purpose software
- 4. allow the rigorous definition and efficient processing of texts
- 5. provide for user-defined extensions
- 6. conform to existing and emergent standards
-
- We can expatiate on these, if anyone isn't sure what we mean by
- them, but I won't here.
-
- The current draft, be it noted, does *not* solve all these problems
- or wholly fulfill all of the design goals. It wasn't expected to --
- some of the hard problems were intentionally saved for the second cycle.
- Here is my personal checklist of where we stand with respect to the
- goals listed above (which as you can tell from the overlaps were taken
- >from different documents).
-
- * The current draft (version 1.0) does specify both an interchange
- format and recommendations, though perhaps not as explicitly as one
- might have expected. It may need to become more explicit in defin-
- ing the interchange format.
-
- * It does not document any existing encoding schemes, though work is
- continuing on that topic.
-
- * The metalanguage and syntax committee did consider the formulation
- of a metalanguage for defining existing schemes, but decided against
- it. Descriptions will take the form of prose and of algorithms for
- translating from a given scheme into the TEI scheme, using a variety
- of existing software tools (e.g. sed scripts, Rexx execs, Snobol
- programs, or even yacc and lex code).
-
- * It is certainly a set of guidelines rather than requirements, and
- device- and software-independent. It is also, however, not fully
- implemented in software -- this has the advantage that the design is
- not unduly biased by implementation issues, but it makes it hard to
- demonstrate or validate the scheme.
-
- * It is extensible, but the mechanisms for specifying extensions need
- work to be usable without heavy-duty knowledge of SGML.
-
- * It has no bias that we have consciously put there in favor of any
- one language, but the TEI has not addressed, let alone solved, the
- problems of languages other than those already most effectively cov-
- ered by international data-processing standards. The current draft
- is silent on topics where people need the most guidance: older
- forms of languages not covered by ISO standards, Asian scripts,
- treatment of bidirectional text (e.g. Hebrew and English), and so
- on. We expect to work on these in the next two years, but for some
- issues there is little we can do but document and call attention to
- existing methods of handling these problems (e.g. ISO 10646 or the
- Unicode effort -- two unfortunately incompatible approaches to han-
- dling Chinese and other Asian scripts).
-
- * It does provide what we think is an adequate *basis* for handling
- all the known needs of research; it probably needs extension in many
- areas to provide not just the *basis* for the required solutions,
- but some version of the solutions themselves.
-
- * It's as simple and clear as we could make it, but we expect to hear
- about lots of obscurities in the draft. (Let's say it again--please
- let us know if there are things that aren't clear!)
-
- * It can be used without special software, at least at the simpler
- levels. A lot of work is needed, however, before we have something
- we can hand to the average literary scholar who uses Nota Bene or
- Word Perfect or Microsoft Word and wants to create a TEI-conformant
- file. (Volunteer macro-writers sought!)
-
- * So far, at least, the Guidelines can be used as specified in the ISO
- standard which defines SGML. There are some technical reasons which
- mean that the TEI guidelines may not be definable as a "conforming
- application" of SGML -- these mostly relate to syntactic freedoms of
- SGML which are forbidden by the current version of the Guidelines.
-
- That's it for the basic goals of the TEI. Coming up: discussions of
- SGML basics, the TEI tags for core structural features, other core tags
- in the TEI scheme, and character-set issues. After that, we should be
- able to raise some of the more advanced questions.
-
- -Michael Sperberg-McQueen
- ACH / ACL / ALLC Text Encoding Initiative
- University of Illinois at Chicago
-
-
-